Crawling JavaScript websites using WebKit – with application to analysis of hate speech in online discussions
نویسندگان
چکیده
JavaScript Client-side hidden web pages (CSHW) contain dynamic material created as a result of specific user activities. The number of CSHW websites is increasing. Crawling the so-called Hidden Web is challenging, particularly when JavaScript CSHW from an external website is seamlessly included as part of the web pages. We have developed a prototype web crawler that efficiently extracts content from CSHW. The crawler uses WebKit to render web pages and to emulate human web page activities to reveal dynamic content. The WebKit crawler was used to collect text from 39 Norwegian online newspaper debate articles, where the online user discussions were included as JavaScript CSHW from other websites. The average speed to extract the main content and the JavaScript-generated discussions were 36.3 kB/sec and 8.8 kB/sec, respectively. Analyzing the collected text from the news paper debate articles using opinion mining, documents that the debate articles are more positive to Islam and Muslims than the following discussions. The results demonstrate the importance of being able to collect such JavaScript CSHW discussion content to get an overview of existing hate speech on the Internet.
منابع مشابه
Flexible Access Control for JavaScript pdfauthor=Richards, Hammer, Zappa Nardelli, Jagannathan, Vitek
Providing security guarantees for systems built out of untrusted components requires the ability to define and enforce access control policies over untrusted code. In Web 2.0 applications, JavaScript code from different origins is often combined on a single page, leading to well-known vulnerabilities. We present a security infrastructure which allows users and content providers to specify acces...
متن کاملjÄk: Using Dynamic Analysis to Crawl and Test Modern Web Applications
Web application scanners are popular tools to perform black box testing and are widely used to discover bugs in websites. For them to work effectively, they either rely on a set of URLs that they can test, or use their own implementation of a crawler that discovers new parts of a web application. Traditional crawlers would extract new URLs by parsing HTML documents and applying static regular e...
متن کاملInformation Flow Control in WebKit's JavaScript Bytecode
Websites today routinely combine JavaScript from multiple sources, both trusted and untrusted. Hence, JavaScript security is of paramount importance. A specific interesting problem is information flow control (IFC) for JavaScript. In this paper, we develop, formalize and implement a dynamic IFC mechanism for the JavaScript engine of a production Web browser (specifically, Safari’s WebKit engine...
متن کاملHate Speech Detection with Comment Embeddings
We address the problem of hate speech detection in online user comments. Hate speech, defined as an “abusive speech targeting specific group characteristics, such as ethnicity, religion, or gender”, is an important problem plaguing websites that allow users to leave feedback, having a negative impact on their online business and overall user experience. We propose to learn distributed low-dimen...
متن کاملFinding and Emulating Keyboard, Mouse, and Touch Interactions and Gestures while Crawling RIA's
Crawling JavaScript heavy Rich Internet Applications has been a hot topic in recent years, giving us automated tools for indexing content, test generation, and securityand accessibility evaluation to mention a few examples. However, existing crawling techniques tend to ignore user interactions beyond mouse clicking, and therefore often fail to consider potential mouse, keyboard and touch intera...
متن کامل